2024-08-12
If you don’t already have it, go to https://cran.r-project.org/ and install an appropriate version of R
You can install R-Studio from here: https://posit.co/download/rstudio-desktop/
Write commands in the source window (and save them when you’re done)
View output and results in the console
See what you’ve loaded or stored in the environment window
View and export plots from the output window (and manage packages from the packages tab)
Type the following in the source window:
Click your cursor on the top line and press CTRL + ENTER (CMD + Enter on a Mac). Your cursor should move down and the code will be evaluated. Take note of what happens and where things show up after executing each line.
Text after a # symbol is ignored by R, so this first line is just a comment. Use comments to remind yourself and others of what your code is doing.
The <- operator will assign a name to some data, and rnorm is a function that generates random data with a normal (bell-curve) distribution.
Here, I’ve created 100 random numbers and assigned them to a variable named x
If I just reference the variable x, R will print its contents into the console.
[1] -1.32166157 -1.09041677 -0.17737279 -0.04363539 -0.93385626 0.31722219
[7] -0.64866531 1.59994063 0.78508201 -0.57353822 0.47568361 0.58762138
[13] -0.60230830 -0.97089061 0.67349350 0.58777949 -1.08545699 1.26911721
[19] 1.30279300 -1.21668076 0.57959259 -1.34244170 -1.41482697 -0.21515942
[25] -0.74964633 0.37733581 -0.09120799 0.97607295 -0.08940096 -2.07759765
[31] -1.19936552 -0.85392567 0.24280478 -2.75097627 1.49202394 0.03581691
[37] -0.36853097 0.38129608 -0.82444607 0.96084753 1.08421006 0.01584561
[43] 0.88713460 -0.45344669 -1.25060864 0.33638558 -1.81455065 -0.28571450
[49] -0.96454568 -1.68540171 -0.74442012 -0.72863844 -0.17307513 0.85166770
[55] -0.08258005 -0.07402357 0.04608886 1.23755538 0.13947027 -0.26674647
[61] -1.63342247 -1.27017623 -0.48908236 -1.17156518 -1.23908137 -1.56023483
[67] -0.38661179 -0.04171302 1.06463812 0.39548166 1.70006085 0.86015641
[73] -1.30233093 -0.35860879 -0.51457139 -0.59019298 -0.69354791 0.02279567
[79] -1.73141073 -1.35576059 0.59038794 1.90176456 1.35120487 0.12500529
[85] 0.32950977 1.03755141 -0.43439550 -0.05143596 -1.21871698 0.60413142
[91] -0.24240672 0.28854010 -0.10426365 0.96035056 -0.62558064 0.38943461
[97] 1.06728259 -0.38819202 1.19022687 -0.42156942
Finally, the hist() function takes some data and plots a histogram. I’m telling R “create a histogram from the contents of variable x”
[1] -1.297449689 -0.221671545 -0.297232552 -2.495615938 -3.762765396
[6] -0.189378111 -0.588306635 -0.524049441 -0.201615410 -1.441120106
[11] 0.285905045 -0.795397458 1.581321577 0.944774916 -0.307963257
[16] -0.548594764 -0.747103619 2.123064771 1.217030366 -1.137125842
[21] -1.378033757 -0.981581786 0.327595631 0.116081185 -0.626471502
[26] -0.308873446 -0.227739425 0.191857637 1.456165220 -1.532970715
[31] -1.277447634 -0.676073773 1.651042515 -0.222105947 1.455989702
[36] -1.418709428 -1.486234669 0.976058699 -0.005524805 1.207668067
[41] 0.334014408 -0.651475721 0.450292384 -1.366668804 -1.619994356
[46] -0.640594284 0.530293222 1.290997596 0.941906651 0.375794413
[51] 0.602470918 1.032370798 1.591679645 1.933317016 -0.200515479
[56] 0.779245627 2.054436243 -0.392578563 0.646180608 -0.328942519
[61] -0.320480026 -0.240775049 0.640760166 -0.483589489 0.341830146
[66] 0.027008367 -1.602438671 0.527182650 -0.644651050 -1.042110786
[71] -0.123270328 1.906187211 0.420165958 0.960492674 -0.147068792
[76] 0.590968441 -0.903510911 1.625800320 0.296693967 -0.415044480
[81] 1.439818214 -0.830440687 0.455946404 -0.060058874 -1.655123372
[86] 0.679813723 0.787324911 -0.375429398 -1.508245713 -0.599408158
[91] -0.378771866 2.660613295 -0.273290521 0.022507875 -0.170522516
[96] -0.272498303 -0.015068304 -0.014084658 0.545675374 0.906446632
Now, with the same script open, press CTRL + SHIFT + S
What happened?
Do the following:
Close R-studio
When asked if you would like to save the Workspace Image, click “No” (and ideally never click “Yes” ever from now until the end of time)
Re-open R studio
Try to run just the last line of the code again:
What happened?
The data in R’s global environment goes away when you restart R.
You can’t evaluate or run functions on x until you create x. The order of commands matters.
But scripts are smaller, easier to share, easier to replicate, and easier to re-use than data itself. So try to store scripts rather than making new copies of data (when feasible)
R packages extend the basic functionality of R so you can do more stuff more easily.
To use a package, you first need to: . . . - Install it (just once!) using the install.packages(“packagename”)` function or the R-Studio interface. . . . - Then you need to load it using the `library(packagename)` command each time you open R.
Install the tidyverse package by running (note the quotation marks):
Or use the graphical interface
Load the Snake package by running (note the lack of quotation marks)
Now you can use commands from the Snake package. Try running this:
Admittedly this is not a typical use-case for this tool…
You can use the Import Dataset menu to import data into R with point-and-click commands OR by writing code.
We’ll walk through both approaches using this data set as an example:
https://raw.githubusercontent.com/fivethirtyeight/data/master/bob-ross/elements-by-episode.csv
Which contains data from the article “A Statistical Analysis of the Work of Bob Ross“ by Walt Hickey.
(Walk through this in class, but just a reminder that you can access this by click the “import dataset” button in the top right pane, and then selecting “From text (readr)” from the menu and pasting the data URL into the bar at the top.
After loading the data with point and click commands, we should copy the code that produced it and save it in our script, that way, we can easily replicate or share our analysis by just re-running the script.
# load the readr package
library(readr)
# use the read_csv function to read the data
elements_by_episode <- read_csv(
"https://raw.githubusercontent.com/fivethirtyeight/data/master/bob-ross/elements-by-episode.csv"
)
# optionally, use View() to examine the data in the GUI
#View(elements_by_episode)library function. The readr package is what allows us to use the read_csv function, which is not a built-in command.
View() function is handy, but it can be kind of annoying because it makes a pop-up, so I placed a # in front of this command to turn it into a comment.
Click the data set name in the environment window to view it
What is the unit of analysis?
How many observations are there (rows)? How many variables (columns)? How are things represented?
In the source window, start typing in the name of the data set (elements_by_episode) what happens?
We’ve already used the <- operator to store a value and reference it by name. We call this process assignment.
Assignment is very flexible. You can change the value of an existing variable, you can make a copy with a new name, and you can even combine multiple variables into one or add them together.
Variable names must start with either a letter or a period. They usually can’t contain spaces or certain special operators like + or -, but you can use these characters if you wrap the name in ` symbols:
A good practice is to give descriptive names to variables and use underscores (_) in place of spaces.
In addition to storing data, R variables can have additional attributes and classes that impact how they’re stored, modified, or used in functions.
One of the most fundamental attributes is a variable’s mode, which is how R knows how to do things like distinguish numbers from text.
We’ve already seen numeric data. But two other very common ones are:
character which is used for storing text and can be created by entering values inside quotation marks.
What happens when you run this? And why?
Data structures allow us to store and perform calculations on groups of numbers or text. The ones we’ll see most often are vector, matrix, data frame and list. We’ll talk briefly about each.
vectors store multiple elements of the same type. You can create a vector by passing a comma-separated list to the c() function.
The elements of a vector must share a type. If they don’t, then R will “coerce” each element to make them conform.
We can use the [] operator to access specific elements of a vector. For instance, I can get the 2nd element of this vector by writing:
We can also use vectors to subset other vectors
And we can use a logical comparison to subset.
This comparison creates a logical vector:
And this uses a logical vector to subset another vector:
A matrix has a single type of data arranged in a fixed number of rows and columns. Here’s a matrix with 3 columns and 5 rows.
“Under the hood” a matrix is really just a vector with some extra attributes, so I can subset it just like I would a vector:
However, certain functions only make sense for matrices:
It usually makes more sense to take an entire row or column from a matrix. To do that, I can use syntax like this:
Data frames have rows and columns like a matrix, but:
data$colname notation. You can still use matrix-style indexing, though!Use the $ operator to access entire columns (but not rows!)
Or you can use double brackets followed by a column name in quotation marks
You’re likely to encounter data frames more than any other type of data. However, many statistical operations will coerce your data to a vector or matrix before actually conducting the analysis.
Lists are like data frames without any of the restrictions. They can contain any number of types and can even contain other lists or data frames:
mylist = list("letters" = c("A", "B", "C"),
"scalar" = 10,
"nested_list" = list("palette_1" = list("red", "blue", "green", "white"),"palette_2" = list("pink","brown", "black"))
)
mylist$letters
[1] "A" "B" "C"
$scalar
[1] 10
$nested_list
$nested_list$palette_1
$nested_list$palette_1[[1]]
[1] "red"
$nested_list$palette_1[[2]]
[1] "blue"
$nested_list$palette_1[[3]]
[1] "green"
$nested_list$palette_1[[4]]
[1] "white"
$nested_list$palette_2
$nested_list$palette_2[[1]]
[1] "pink"
$nested_list$palette_2[[2]]
[1] "brown"
$nested_list$palette_2[[3]]
[1] "black"
You can also access parts of a list using the $ or [[]] operators.
And you can also access parts of a nested list:
You generally won’t use lists for your analysis because you can’t really summarize their contents easily. But you’ll still see them used:
As the objects returned by R functions (such as the lm() command) that need to return a bunch of information.
For storing or transporting complex data (such as information used by web services) that can’t fit neatly into a data frame.
Using the Bob Ross data frame from earlier:
how would you find all episodes where Ross painted trees? How many are there?
How would you find all of the elements he painted in episode 1?
The “GUEST” column contains a 1 if an episode had a guest host and 0 otherwise. How would you remove all of these so you only had Bob Ross paintings?
Functions are basically just blocks of generalized code that can be used over and over again. We’ve already used several:
c() concatenates values into a vectormatrix() turns a vector into a matrixdata frame turns a series of equal-length vectors into a data framemode() tells you the type of a particular kind of dataWe’ve also used operators like + and -. In actuality, these are also a kind of R function.
“Infix” functions are so obvious you probably don’t even think of them as functions. Try typing this into the script editor and then send it to R:
Infix operators will always take a left hand side (LHS) argument and a right hand side argument (RHS)
LSH + RHS
Arithmetic
| Operator | Usage | Example |
|---|---|---|
| +, -, /, * | plus, minus, divide, and multiply, respectively | 1 + 12 /3 |
| %%, %/% | Modulo division and integer division |
|
| ^ | Raise the left hand side to the power of the right hand side | 3 ^ 4 |
| == | test for equality |
|
| != | test for inequality | "Horse" != "Donkey" |
| >, <, >=, <= | greater than, less than, geq, leq |
|
| <-, = | assign RHS to LHS ( <- is preferred over the equals sign because its less likely to be confused for a comparison) |
x <- 13 |
| : | make a sequence of numbers from LHS to RHS | 1:10 |
| [], [[]], $ | Subsetting (get an entire row or a column or single observation) | x[1] , cars[4, ], iris$Sepal.Length |
| ~ | Used in model formulas (like when we want to estimate the effect of a variable on an outcome) | lm(speed~dist, data=cars) |
Most R functions will use prefix notation. Use a prefix function by first writing the name of the function, followed by parentheses with some arguments inside.
The help function is a special function that brings up information about functions. Run this and see what shows up in the bottom right window of R-Studio.
Lets take a closer look at the Default S3 method here for the mean() function
x is the main input.
two remaining arguments (trim and na.rm) have a name AND default values. This means we don’t have to write out these arguments every time.
Try running this:
[1] NA
Now try this:
[1] 3.4
Note
R is pedantic about math! There really isn’t a valid way to calculate the mean with NA values, and NA is supposed to be a placeholder, so R doesn’t want to ignore it by default. So you’re forced to be explicit.
One more thing to note here is that we can name each argument explicitly, but we don’t have to:
Now try without specifying the name x:
If we don’t name the arguments, R will assume they’re being given in the order specified in the help file. This can cause problems if we get things out of sequence. Which is why this is okay:
You can create your own functions. To create a function, you’ll use the following general syntax:
Now execute your function just like you would any other R function.
Note: built in functions like sum() will be available as soon as you load R, but user-created functions will just be loaded into the global environment, so they go away when you close R.
help() to get information about an R function.Find the top 10 most common elements Bob Ross painted and create a barplot showing the frequency of each.
Start by writing out the steps you need to perform as comments:
# Step 0. import the data (you should already have the code for this)
# Step 1. remove the columns with episode numbers or titles
# Step 2. get the sum of the numeric columns (there's a way to do this with just one function!)
# Step 3. sort the sums from highest to lowest
# Step 4. take the top 10 elements from this sorted result
# Step 5. create a bar plot from the object created in step 4 Error messages are informative! Even if they seem inscrutable.
When something doesn’t work, stop, take a breath, and then read the output.
Consult the help files
Search the internet! Few problems are unique and R has a lot of users.
Its pointless to try to memorize everything. Instead, try to follow good script-writing practices
Give objects meaningful names
Write lots of comments
Save your scripts (also with meaningful names)
Use the GUI to figure stuff out, but write the equivalent commands in your script
Don’t try to tackle complex problems in one fell swoop. Take a couple of observations, figure out the correct answer, and then try to write code that gets you there. Then generalize that case to the entire data set.
Talk to yourself: use the comments to outline what you will do before you do it.
Ask for help from your classmates or instructors! You’ll get better help if you: share enough code or data to allow someone else to replicate your problem easily.